7 research outputs found
Estimating the compression fraction of an index using sampling
Data compression techniques such as null suppression
and dictionary compression are commonly used in today’s
database systems. In order to effectively leverage compression, it
is necessary to have the ability to efficiently and accurately
estimate the size of an index if it were to be compressed. Such an
analysis is critical if automated physical design tools are to be
extended to handle compression. Several database systems today
provide estimators for this problem based on random sampling.
While this approach is efficient, there is no previous work that
analyses its accuracy. In this paper, we analyse the problem of
estimating the compressed size of an index from the point of view
of worst-case guarantees. We show that the simple estimator
implemented by several database systems has several “good”
cases even though the estimator itself is agnostic to the internals
of the specific compression algorithm.
efficiently. The naïve method of actually building and
compressing the index in order to estimate its size, while
highly accurate is prohibitively inefficient.
Thus, we need to be able to accurately estimate the
compressed size of an index without incurring the cost of
actually compressing it. This problem is challenging because
the size of the compressed index can depend significantly on
the data distribution as well as the compression technique
used. This is in contrast with the estimation of the size of an
uncompressed index in physical database design tools which
can be derived in a straightforward manner from the schema
(which defines the size of the corresponding column) and the
number of rows in the table
Estimating the compression fraction of an index using sampling
Data compression techniques such as null suppression
and dictionary compression are commonly used in today’s
database systems. In order to effectively leverage compression, it
is necessary to have the ability to efficiently and accurately
estimate the size of an index if it were to be compressed. Such an
analysis is critical if automated physical design tools are to be
extended to handle compression. Several database systems today
provide estimators for this problem based on random sampling.
While this approach is efficient, there is no previous work that
analyses its accuracy. In this paper, we analyse the problem of
estimating the compressed size of an index from the point of view
of worst-case guarantees. We show that the simple estimator
implemented by several database systems has several “good”
cases even though the estimator itself is agnostic to the internals
of the specific compression algorithm.
efficiently. The naïve method of actually building and
compressing the index in order to estimate its size, while
highly accurate is prohibitively inefficient.
Thus, we need to be able to accurately estimate the
compressed size of an index without incurring the cost of
actually compressing it. This problem is challenging because
the size of the compressed index can depend significantly on
the data distribution as well as the compression technique
used. This is in contrast with the estimation of the size of an
uncompressed index in physical database design tools which
can be derived in a straightforward manner from the schema
(which defines the size of the corresponding column) and the
number of rows in the table